Andres Karjus, Richard A. Blythe, Simon Kirby, Kenny Smith
University of Edinburgh
New Directions in Language Evolution Research @ SLE Tallinn 2018
Works similarly in other diachronic databases of cumulative culture (e.g. click here: movies, boardgames, cookbooks)
A useful baseline to include in any model of diachronic frequency change of linguistic (or other cultural) elements.
For each \(w\) with an original frequency \(f\), and each \(s\), downsampled by randomly relabeling a fixed portion of its occurrences as \(w'\) in the corpus, where the portion is defined as \(e^{ln(f) - s} = f/e^s\) (exluded downsamples with \(<10\) occurrences)
E.g., if \(f=1000\), \(s=0.7\), then \(e^{ ln(1000) - 0.7} \approx 496\), or a -50.3% reduction.
On the heatmaps, \(w\) being closest synonym to \(w'\) is marked as False if this occurs in any of the 10 replicates. The downsampling procedure included a 0-downsample as a sanity check, which entailed no reduction in frequency, i.e. 100% of the occurrences were sampled; these are also displayed on the plots as the first column/value.
The advection value of a word in time period \(t\) is defined as the weighted mean of the changes in frequencies (compared to the previous period) of those associated words. More precisely, the topical advection value for a word \(\omega\) at time period \(t\) is
\[\begin{equation} {\rm advection}(\omega;t) := {\rm weightedMean}\big( \{ {\rm logChange}(N_i;t) \mid i=1,...m \}, \, W \big) \end{equation}\]where \(N\) is the set of \(m\) words associated with the target at time \(t\) and \(W\) is the set of weights (to be defined below) corresponding to those words. The weighted mean is simply
\[\begin{equation} {\rm weightedMean}(X, W) := \frac{\sum x_i w_i }{\sum w_i} \end{equation}\]where \(x_i\) and \(w_i\) are the \(i^{\rm th}\) elements of the sets \(X\) and \(W\) respectively. The log change for period \(t\) for each of the associated words \(\omega'\) is given by the change in the logarithm of its frequencies from the previous to the current period. That is,
\[\begin{equation} {\rm logChange}(\omega';t) := \log[f(\omega';t)+1] - \log[f(\omega';t-1)+1] \end{equation}\]where \(f(\omega';t)\) is the number of occurrences of word \(\omega'\) in the time period \(t\). Note we add \(1\) to these frequency counts, to avoid \(\log(0)\) appearing in the expression.
*This research was supported by the scholarship program Kristjan Jaak, funded and managed by the Archimedes Foundation in collaboration with the Ministry of Education and Research of Estonia.